HaiBiostat: Series 5 of 10 -- Introduction to Selection of Variables using Bayesian Approach

Hai Nguyen

Introduction

This example is based on section 18.4 in Kruschke, 2015.

A Bayesian approach variable selection in regression analysis can be done by adding a binary parameter for each slope:
\[y_i = \beta_0 + \sum_j \delta_j \ \beta_j x_{i,j},\]

with \[\delta_j = 0, \ 1.\]

Every combination of values \(\delta_1, \ ..., \delta_k\) gives a submodel. Simple priors for delta-indicators are independent Bernoulli distributions: \(\delta \sim \text{dbern}(0.5)\). Then posterior probabilities \(P(\delta_j = 1\mid D)\) show importance of the corresponding predictors \(x_i\).

The diagram of such model is an easy generalization of regression model.

Since parameters \(\delta_j\) are discrete, we need to use JAGS: library(runjags) instead of Stan

library(runjags)

Read data

The SAT data includes:

State: 50 states,
SATV: SAT verbal; SATM: SAT math; SATT: SAT total score
Spend: average spending per pupil
PrcntTake: percentage of students who took the test
StuTeaRat: the average student teacher ratio in each state and
Salary: the average salary of the teachers

These latter four variables are also plausible predictors of SAT score.

dta <- read.csv("data/Guber1999data.csv")
names(dta)

[1] "State"     "Spend"     "StuTeaRat" "Salary"    "PrcntTake"
[6] "SATV"      "SATM"      "SATT"

DT::datatable(dta)

We now let look at the correlation matrix among 4 predictors.

dta %>%
  select(Spend, PrcntTake, StuTeaRat, Salary) %>%
  cor(.) %>%
  round(., 3)

           Spend PrcntTake StuTeaRat Salary
Spend      1.000     0.593    -0.371  0.870
PrcntTake  0.593     1.000    -0.213  0.617
StuTeaRat -0.371    -0.213     1.000 -0.001
Salary     0.870     0.617    -0.001  1.000

That Salary is strongly correlated with Spend, and therefore, a model that includes both Salary and Spend will show a strong trade-off between those predictors and consequently will show inflated uncertainty in the regression coefficients for either one. Should only one or the other be included, or both, or neither?

One more point, response variable is average high school SAT score per state. One of the predictors is amount of money spent by the state on schools. Plotting amount of money spent against SAT scores shows negative slope.

plot(dta$Spend,dta$SATT)

Thus, should we cut funding for school? Because even with the funding, SAT would be reduced disproportional with the budget spent!!

Modeling in JAGS

y <- dta[,'SATT']
x <- as.matrix(dta[ , c("Spend","PrcntTake","StuTeaRat","Salary")])

#prepare data in JAGS
dataList <- list(Ntotal = length(y),
                 y = y,
                 x = x,
                 Nx = ncol(x))

Description of the model:

\[y_i = \beta_0 + \delta_j \ \beta_j x_{i,j} + \epsilon_i \\ y_i \sim t(\mu_i, \sigma, \nu) \\ \\ \text{where mean, scale and degree of freedom, respectively, would be} \\ \mu_i = \beta_0 + \delta_j \ \beta_j x_{i,j} \\ \beta_0 \sim N(0, \frac{1}{1/2^2}) \ ; \ \beta_j \sim t(0, 1/\sigma_{\beta}^2, \nu = 1) \\ \sigma \sim Unif(L, H) \\ \nu = \gamma^{\prime} + 1, \ \text{where} \ \gamma^{\prime} \sim exp(\lambda) \\ \delta_j \sim Bernoulli(0.5) \ ; \ \sigma_{\beta} \sim \text{Gamma(shape, rate)}\]

modelString = "
    # Standardize the data:
    data {
        ym <- mean(y)
        ysd <- sd(y)
        for (i in 1:Ntotal) {
            zy[i] <- (y[i] - ym) / ysd
        }
        for (j in 1:Nx) {
            xm[j]  <- mean(x[,j])
            xsd[j] <-   sd(x[,j])
            for (i in 1:Ntotal) {
                zx[i,j] <- (x[i,j] - xm[j]) / xsd[j]
            }
        }
    }
    # Specify the model for standardized data:
    model {
        for (i in 1:Ntotal) {
            zy[i] ~ dt(zbeta0 + sum(delta[1:Nx] * zbeta[1:Nx] * zx[i,1:Nx] ) ,  1/zsigma^2 , nu)
        }
        # Priors vague on standardized scale:
        zbeta0 ~ dnorm(0, 1/2^2)
        for (j in 1:Nx) {
            zbeta[j] ~ dt(0 , 1/sigmaBeta^2 , 1) 
            delta[j] ~ dbern( 0.5 )
        }
        zsigma ~ dunif(1.0E-5 , 1.0E+1)
        
        # sigmaBeta <- 2.0 ## uncomment one of the following specifications for sigmaBeta
        # sigmaBeta ~ dunif( 1.0E-5 , 1.0E+2 )
        sigmaBeta ~ dgamma(1.1051,0.1051) # mode 1.0, sd 10.0
        # sigmaBeta <- 1/sqrt(tauBeta) ; tauBeta ~ dgamma(0.001,0.001) 
        nu ~ dexp(1/30.0)
        
        # Transform to original scale:
        beta[1:Nx] <- (delta[1:Nx] * zbeta[1:Nx] / xsd[1:Nx])*ysd
        beta0 <- zbeta0*ysd  + ym - sum(delta[1:Nx] * zbeta[1:Nx] * xm[1:Nx] / xsd[1:Nx])*ysd
        sigma <- zsigma*ysd
    }
"

parameters <- c("beta0",  "beta",  "sigma", "delta", "sigmaBeta", "zbeta0", "zbeta", "zsigma", "nu" )
runjagsMethod <- "parallel"  # change to "rjags" in case of working on 1-core CPU 
# run JAGS

runJagsOut <- run.jags(method = runjagsMethod,
                       model = modelString, 
                       monitor = parameters, 
                       data = dataList,
                       n.chains = 3, #nChains <- 3
                       adapt = 500, #adaptSteps <- 500
                       burnin = 1000, #burnInSteps <- 1000
                       sample = ceiling(15000/3), #numSavedSteps <- 15000;  nChains <- 3
                       thin = 25, #thinSteps <- 25
                       summarise = FALSE,
                       plots = FALSE)

# generate summary info for all parameters
summary(runJagsOut)

Calculating summary statistics...
Note: The monitored variable 'delta[2]' appears to be
non-stochastic; it will not be included in the convergence
diagnostic
Calculating the Gelman-Rubin statistic for 18 variables....
Error : The "modeest" package is required to calculate the mode of continuous variables

               Lower95        Median      Upper95          Mean
beta0     944.66400000  1.011275e+03  1.11231e+03  1.018969e+03
beta[1]    -0.00128211  7.803640e+00  1.91766e+01  7.314723e+00
beta[2]    -3.26471000 -2.761835e+00 -2.22909e+00 -2.753613e+00
beta[3]    -5.57611000  0.000000e+00  4.47374e-03 -6.349370e-01
beta[4]    -0.26275600  0.000000e+00  3.55510e+00  3.305153e-01
sigma      24.62630000  3.214530e+01  4.10201e+01  3.238811e+01
delta[1]    0.00000000  1.000000e+00  1.00000e+00  6.138000e-01
delta[2]    1.00000000  1.000000e+00  1.00000e+00  1.000000e+00
delta[3]    0.00000000  0.000000e+00  1.00000e+00  1.732667e-01
delta[4]    0.00000000  0.000000e+00  1.00000e+00  2.208667e-01
sigmaBeta   0.02406400  1.127925e+00  7.94953e+00  2.198454e+00
zbeta0     -0.12437500 -1.292225e-04  1.29719e-01  1.964852e-04
zbeta[1]   -7.77130000  2.129285e-01  1.10354e+01  5.264458e-01
zbeta[2]   -1.16775000 -9.878755e-01 -7.97319e-01 -9.849344e-01
zbeta[3]  -24.01480000 -8.063585e-02  2.36377e+01 -2.338195e-01
zbeta[4]  -17.73010000  1.036410e-01  1.87959e+01  3.378339e-01
zsigma      0.32913900  4.296320e-01  5.48246e-01  4.328772e-01
nu          1.91471000  2.685290e+01  9.61985e+01  3.558561e+01
                   SD Mode        MCerr MC%ofSD SSeff       AC.250
beta0     43.89744488   NA 0.4821033688     1.1  8291  0.019090161
beta[1]    7.04248924   NA 0.1017558219     1.4  4790  0.064596992
beta[2]    0.27026073   NA 0.0033981576     1.3  6325  0.053310641
beta[3]    1.74666227   NA 0.0151369398     0.9 13315 -0.009957991
beta[4]    0.99368906   NA 0.0093047567     0.9 11405  0.010317963
sigma      4.16357566   NA 0.0387315148     0.9 11556  0.016315434
delta[1]   0.48689359    1 0.0078742141     1.6  3823  0.085242124
delta[2]   0.00000000    1           NA      NA    NA           NA
delta[3]   0.37849026    0 0.0034208127     0.9 12242 -0.004375483
delta[4]   0.41484462    0 0.0038539951     0.9 11586  0.010031033
sigmaBeta  3.24829579   NA 0.0687169662     2.1  2235  0.115133738
zbeta0     0.06462281   NA 0.0005367156     0.8 14497 -0.007738634
zbeta[1]   5.53722794   NA 0.2424652404     4.4   522  0.502781355
zbeta[2]   0.09666902   NA 0.0012154802     1.3  6325  0.053310325
zbeta[3]  18.13696053   NA 0.5572712553     3.1  1059  0.319423413
zbeta[4]  14.61246772   NA 0.3817438159     2.6  1465  0.293091432
zsigma     0.05564748   NA 0.0005176588     0.9 11556  0.016315483
nu        30.03665106   NA 0.2442544957     0.8 15122 -0.003220931
               psrf
beta0     1.0000092
beta[1]   1.0001584
beta[2]   1.0002489
beta[3]   1.0006953
beta[4]   1.0006186
sigma     1.0000140
delta[1]  1.0006266
delta[2]         NA
delta[3]  1.0008028
delta[4]  1.0006483
sigmaBeta 1.0016839
zbeta0    1.0001875
zbeta[1]  1.0614080
zbeta[2]  1.0002489
zbeta[3]  1.0456877
zbeta[4]  1.0327248
zsigma    1.0000147
nu        0.9999383

# plot all params
plot(runJagsOut,
     plot.type = c("trace", "ecdf", "histogram", "autocorr"))

Generating summary statistics and plots (these will NOT be
saved for reuse)...
Calculating summary statistics...
Note: The monitored variable 'delta[2]' appears to be
non-stochastic; it will not be included in the convergence
diagnostic
Calculating the Gelman-Rubin statistic for 18 variables....
Error : The "modeest" package is required to calculate the mode of continuous variables

Find posterior probability of different combinations of predictors by calculating frequencies with which they appeared in MCMC.

trajectoriesDelta <- as.matrix(runJagsOut$mcmc[,7:10])
head(trajectoriesDelta)

     delta[1] delta[2] delta[3] delta[4]
[1,]        0        1        0        0
[2,]        1        1        0        0
[3,]        1        1        0        0
[4,]        1        1        0        0
[5,]        1        1        1        0
[6,]        0        1        0        0

Nchain <- nrow(trajectoriesDelta)

model \(SATT = \beta_0 + \beta_1 \ Spend + \beta_2 \ PrcntTake\)

(config1 <- sum(apply(trajectoriesDelta, 1, function(z) prod(z==c(1,1,0,0))))/Nchain)

[1] 0.4738667

model \(SATT = \beta_0 + \beta_1 \ Spend\)

(config2 <- sum(apply(trajectoriesDelta, 1,function(z) prod(z==c(1,0,0,0))))/Nchain)

[1] 0

model \(SATT = \beta_0 + \beta_2 \ PrcntTake\)

(config3 <- sum(apply(trajectoriesDelta, 1, function(z) prod(z==c(0,1,0,0))))/Nchain)

[1] 0.2154667

model \(SATT = \beta_0 + \beta_1 \ Spend + \beta_2 \ PrcntTake + \beta_3 \ StuTeaRat + \beta_4 \ Salary\)

(config4 <- sum(apply(trajectoriesDelta, 1, function(z) prod(z==c(1,1,1,1))))/Nchain)

[1] 0.02186667

Inclusion Probabilities

(inclSpend<-sum(trajectoriesDelta[,1]==1)/Nchain) #Spend

[1] 0.6138

(inclPrcntTake<-sum(trajectoriesDelta[,2]==1)/Nchain) #PrcntTake

[1] 1

(inclStueTeaRat<-sum(trajectoriesDelta[,3]==1)/Nchain) #StueTeaRat

[1] 0.1732667

(inclSalary<-sum(trajectoriesDelta[,4]==1)/Nchain) #Salary

[1] 0.2208667

Conclusions

Out of 4 analyzed configurations of the model the most observed is configuration 1 (0.4738667).
For configuration 4 with all predictors observation rate is just 0.0218667.
Out of 4 variables two are selected with inclusion probabilities above 0.5, those are Spend (0.6138) and PrcntTake (1).
The second variable PrcntTake is included with non-stochastic probability of 1, standard deviation of \(\delta_2\) is zero.

Warning. Posterior probabilities of inclusion may be very sensitive to prior information.

References

Bayesian Methods, UC’s lecture
Kruschke, John K. Doing Bayesian Data Analysis: a Tutorial with R, JAGS, and Stan. 2nd ed., Academic Press is an imprint of Elsevier, 2015.

Comment on this article Share:

Series 5 of 10 – Introduction to Selection of Variables using Bayesian Approach